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The attached appendix "Rapid Object Detection Using a Boosted Cascade of Simple 
Features " describes one example of a preferred embodiment of the present invention, 
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Abstract 

This paper describes a machine learning approach for vi- 
sual object detection which is capable of processing images 
extremely rapidly and achieving high detection rates. This 
work is distinguished by three key contributions. The first 
is the introduction of a new image representation called the 
"Integral Image " which allows the features used by our de- 
tector to be computed very quickly. The second is a learning 
algorithm, based on AdaBoost, which selects a small num- 
ber of critical visual features from a larger set and yields 
extremely efficient classifiers [5] . The third contribution is 
a method for combining increasingly more complex classi- 
fiers in a <( cascade " which allows background regions of the 
image to be quickly discarded while spending more compu- 
tation on promising object-like regions. The cascade can be 
viewed as an object specific focus-of attention mechanism 
which unlike previous approaches provides statistical guar- 
antees that discarded regions are unlikely to contain the ob- 
ject of interest In the domain of face detection the system 
yields detection rates comparable to the best previous sys- 
tems. Used in real-time applications, the detector runs at 
15 frames per second without resorting to image differenc- 
ing or skin color detection. 

1. Introduction 

This paper brings together new algorithms and insights to 
construct a framework for robust and extremely rapid obj ect 
detection. This framework is demonstrated on, and in part 
motivated by, the task of face detection. Toward this end 
we have constructed a frontal face detection system which 
achieves detection and false positive rates which are equiv- 
alent to the best published results [14, 11, 13, 10, 1]. This 
face detection system is most clearly distinguished from 
previous approaches in its ability to detect faces extremely 
rapidly. Operating on 384 by 288 pixel images, faces are de- 
tected at 1 5 frames per second on a conventional 700 MHz 
Intel Pentium III. In other face detection systems, auxiliary 
information, such as image differences in video sequences, 
or pixel color in color images, have been used to achieve 



Michael Jones 
michael . j ones@compaq . com 
Compaq Cambridge Research Lab 
One Cambridge Center 
Cambridge, MA 02142 

high frame rates. Our system achieves high frame rates 
working only with the information present in a single grey 
scale image. These alternative sources of information can 
also be integrated with our system to achieve even higher 
frame rates. 

There are three main contributions of our object detec- 
tion framework. We will introduce each of these ideas 
briefly below and then describe them in detail in subsequent 
sections. 

The first contribution of this paper is a new image repre- 
sentation called an integral image that allows for very fast 
feature evaluation. Motivated in part by the work of Papa- 
georgiou et al our detection system does not work directly 
with image intensities [9]. Like these authors we use a set 
of features which are reminiscent of Haar Basis functions 
(though we will also use related filters which are more com- 
plex than Haar filters). In order to compute these features 
very rapidly at many scales we introduce the integral im- 
age representation for images. The integral image can be 
computed from an image using a few operations per pixel. 
Once computed, any one of these Harr-like features can be 
computed at any scale or location in constant time. 

The second contribution of this paper is a method for 
constructing a classifier by selecting a small number of im- 
portant features using AdaBoost [5]. Within any image sub- 
window the total number of Harr-like features is very large, 
far larger than the number of pixels. In order to ensure fast 
classification, the learning process must exclude a large ma- 
jority of the available features, and focus on a small set of 
critical features. Motivated by the work of Tieu and Viola, 
feature selection is achieved through a simple modification 
of the AdaBoost procedure: the weak learner is constrained 
so that each weak classifier returned can depend on only a 
single feature [15]. As a result each stage of the boosting 
process, which selects a new weak classifier, can be viewed 
as a feature selection process, AdaBoost provides an effec- 
tive learning algorithm and strong bounds on generalization 
performance [12, 8, 9]. 

The third major contribution of this paper is a method 
for combining successively more complex classifiers in a 
cascade structure which dramatically increases the speed of 
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the detector by focusing attention on promising regions of 
the image. The notion behind focus of attention approaches 
is that it is often possible to rapidly determine where in an 
image an object might occur [16, 7, 1], More complex pro- 
cessing is reserved only for these promising regions. The 
key measure of such an approach is the "false negative" rate 
of the attentional process. It must be the case that all, or 
almost all, object instances are selected by the attentional 
filter. 

We will describe a process for training an extremely sim- 
ple and efficient classifier which can be used as a "super- 
vised" focus of attention operator. The term supervised 
refers to the fact that the attentional operator is trained to 
detect examples of a particular class. In the domain of face 
detection it is possible to achieve fewer than 1% false neg- 
atives and 40% false positives using a classifier constructed 
from two Harr-like features. The effect of this filter is to 
reduce by over one half the number of locations where the 
final detector must be evaluated. 

Those sub- windows which are not rejected by the initial 
classifier are processed by a sequence of classifiers, each 
slightly more complex than the last. If any classifier rejects 
the sub-window, no further processing is performed. The 
structure of the cascaded detection process is essentially 
that of a degenerate decision tree, and as such is related to 
the work of Geman and colleagues [1,3]. 

An extremely fast face detector will have broad prac- 
tical applications. These include user interfaces, image 
databases, and teleconferencing. In applications where 
rapid frame-rates are not necessary, our system will allow 
for significant additional post-processing and analysis. In 
addition our system can be implemented on a wide range of 
small low power devices, including hand-helds and embed- 
ded processors. In our lab we have implemented this face 
detector on the Compaq iPaq handheld and have achieved 
detection at two frames per second (this device has a low 
power 200 MIPS Strong Arm processor which lacks float- 
ing point hardware). 

The remainder of the paper describes our contributions 
and a number of experimental results, including a detailed 
description of our experimental methodology. Discussion 
of closely related work takes place at the end of each sec- 
tion. 

2. Features 

Our object detection procedure classifies images based on 
the value of simple features. There are many motivations 
for using features rather than the pixels directly. The most 
common reason is that features can act to encode ad-hoc 
domain knowledge that is difficult to learn using a finite 
quantity of training data. For this system there is also a 
second critical motivation for features: the feature based 
system operates much faster than a pixel-based system. 

The simple features used are reminiscent of Haar basis 
functions which have been used by Papageorgiou et al. [9]. 




Figure 1: Example rectangle features shown relative to the 
enclosing detection window. The sum of the pixels which 
lie within the white rectangles are subtracted from the sum 
of pixels in the grey rectangles. Two-rectangle features are 
shown in (A) and (B). Figure (C) shows a three-rectangle 
feature, and (D) a four-rectangle feature. 

More specifically, we use three kinds of features. The value 
of a two-rectangle feature is the difference between the sum 
of the pixels within two rectangular regions. The regions 
have the same size and shape and are horizontally or ver- 
tically adjacent (see Figure 1). A three-rectangle feature 
computes the sum within two outside rectangles subtracted 
from the sum in a center rectangle. Finally a four-rectangle 
feature computes the difference between diagonal pairs of 
rectangles. 

Given that the base resolution of the detector is 24x24, 
the exhaustive set of rectangle features is quite large, over 
180,000 . Note that unlike the Haar basis, the set of rectan- 
gle features is overcomplete 1 . 

2.1. Integral Image 

Rectangle features can be computed very rapidly using an 
intermediate representation for the image which we call the 
integral image. 2 The integral image at location x, y contains 
the sum of the pixels above and to the left of x, y, inclusive: 

ii{x,y)= ^'>2>')' 

x' <x,y'<y 

where ii(x y y) is the integral image and i(x, y) is the origi- 
nal image. Using the following pair of recurrences: 

(1) 
(2) 



s(x 7 y) = s(x,y - 1) + i{x,y) 
ii(x, y) = ii{x- 1, y) + s(x, y) 



(where s(x,y) is the cumulative row sum, s(x, -1) = 0, 
and n(-l, y) = 0) the integral image can be computed in 
one pass over the original image. 

1 A complete basis has no linear dependence between basis elements 
and has the same number of elements as the image space, in this case 576. 
The full set of 180,000 thousand features is many times over-complete. 

2 There is a close relation to "summed area tables" as used in graphics 
[2]. We choose a different name here in order to emphasize its use for the 
analysis of images, rather than for texture mapping. 
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Figure 2: The sum of the pixels within rectangle D can be 
computed with four array references. The value of the inte- 
gral image at location 1 is the sum of the pixels in rectangle 
A. The value at location 2 is A + B, at location 3 is A + C, 
and at location 4 is A-\- B + C + D, The sum within D can 
be computed as 4 + 1 - (2 4- 3). 



Using the integral image any rectangular sum can be 
computed in four array references (see Figure 2). Clearly 
the difference between two rectangular sums can be com- 
puted in eight references. Since the two-rectangle features 
defined above involve adjacent rectangular sums they can 
be computed in six array references, eight in the case of 
the three-rectangle features, and nine for four-rectangle fea- 
tures. 

2.2. Feature Discussion 

Rectangle features are somewhat primitive when compared 
with alternatives such as steerable filters [4, 6]. Steerable fil- 
ters, and their relatives, are excellent for the detailed analy- 
sis of boundaries, image compression, and texture analysis. 
In contrast rectangle features, while sensitive to the pres- 
ence of edges, bars, and other simple image structure, are 
quite coarse. Unlike steerable filters the only orientations 
available are vertical, horizontal, and diagonal. The set of 
rectangle features do however provide a rich image repre- 
sentation which supports effective learning. In conjunction 
with the integral image , the efficiency of the rectangle fea- 
ture set provides ample compensation for their limited flex- 
ibility. 

3. Learning Classification Functions 

Given a feature set and a training set of positive and neg- 
ative images, any number of machine learning approaches 
could be used to learn a classification function. In our sys- 
tem a variant of AdaBoost is used both to select a small set 
of features and train the classifier [5]. In its original form, 
the AdaBoost learning algorithm is used to boost the clas- 
sification performance of a simple (sometimes called weak) 
learning algorithm. There are a number of formal guaran- 
tees provided by the AdaBoost learning procedure. Freund 
and Schapire proved that the training error of the strong 



classifier approaches zero exponentially in the number of 
rounds. More importantly a number of results were later 
proved about generalization performance [12]. The key 
insight is that generalization performance is related to the 
margin of the examples, and that AdaBoost achieves large 
margins rapidly. 

Recall that there are over 180,000 rectangle features as- 
sociated with each image sub-window, a number far larger 
than the number of pixels. Even though each feature can 
be computed very efficiently, computing the complete set is 
prohibitively expensive. Our hypothesis, which is borne out 
by experiment, is that a very small number of these features 
can be combined to form an effective classifier. The main 
challenge is to find these features. 

In support of this goal, the weak learning algorithm is 
designed to select the single rectangle feature which best 
separates the positive and negative examples (this is similar 
to the approach of [15] in the domain of image database 
retrieval). For each feature, the weak learner determines 
the optimal threshold classification function, such that the 
minimum number of examples are misclassified. A weak 
classifier hj(x) thus consists of a feature fj f a threshold B 3 
and a polarity pj indicating the direction of the inequality 
sign: 

0 otherwise 



hj(x) 



Here £ is a 24x24 pixel sub-window of an image. See Fig- 
ure 3 for a summary of the boosting process. 

In practice no single feature can perform the classifica- 
tion task with low error. Features which are selected in early 
rounds of the boosting process had error rates between 0. 1 
and 0.3. Features selected in later rounds, as the task be- 
comes more difficult, yield error rates between 0.4 and 0.5. 

3.1. Learning Discussion 

Many general feature selection procedures have been pro- 
posed (see chapter 8 of [17] for a review). Our final appli- 
cation demanded a very aggressive approach which would 
discard the vast majority of features. For a similar recogni- 
tion problem Papageorgiou et al. proposed a scheme for fea- 
ture selection based on feature variance [9]. They demon- 
strated good results selecting 37 features out of a total 1734 
features. 

Roth et al. propose a feature selection process based 
on the Winnow exponential perceptron learning rule [10]. 
The Winnow learning process converges to a solution where 
many of these weights are zero. Nevertheless a very large 
number of features are retained (perhaps a few hundred or 
thousand). 

3.2. Learning Results 

While details on the training and performance of the final 
system are presented in Section 5, several simple results 
merit discussion. Initial experiments demonstrated that a 
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• Given example images (xi,yi) ,(^n,2/n) where 

yi = 0, 1 for negative and positive examples respec- 
tively. 

• Initialize weights — for y t = 0, 1 respec- 
tively, where m and I are the number of negatives and 
positives respectively. 

• For* = 1,... ,T: 

1. Normalize the weights, 

™t,. +- ; 

so that wt is a probability distribution. 

2. For each feature, j, train a classifier h 3 which 
is restricted to using a single feature. The 
error is evaluated with respect to wt, e 3 = 
J^ % w t \h 3 (x % )-y t l 

3 . Choose the classifier, ht , with the lowest error et . 

4. Update the weights: 

where e z = 0 if example ar l is classified cor- 
rectly, e % = 1 otherwise, and # = 

• The final strong classifier is: 




0 otherwise 



where a. t = log 



Figure 3: The AdaBoost algorithm for classifier learn- 
ing. Each round of boosting selects one feature from the 
180,000 potential features. 



frontal face classifier constructed from 200 features yields 
a detection rate of 95% with a false positive rate of 1 in 
14084. These results are compelling, but not sufficient for 
many real-world tasks. In terms of computation, this clas- 
sifier is probably faster than any other published system, 
requiring 0.7 seconds to scan an 384 by 288 pixel image. 
Unfortunately^ the most straightforward technique for im- 
proving detection performance, adding features to the clas- 
sifier, directly increases computation time. 

For the task of face detection, the initial rectangle fea- 
tures selected by AdaBoost are meaningful and easily inter- 
preted. The first feature selected seems to focus on the prop- 
erty that the region of the eyes is often darker than the region 
of the nose and cheeks (see Figure 4). This feature is rel- 
atively large in comparison with the detection sub-window, 
and should be somewhat insensitive to size and location of 
the face. The second feature selected relies on the property 
that the eyes are darker than the bridge of the nose. 




Figure 4: The first and second features selected by Ad- 
aBoost. The two features are shown in the top row and then 
overlayed on a typical training face in the bottom row. The 
first feature measures the difference in intensity between the 
region of the eyes and a region across the upper cheeks. The 
feature capitalizes on the observation that the eye region is 
often darker than the cheeks. The second feature compares 
the intensities in the eye regions to the intensity across the 
bridge of the nose. 

4. The Attentional Cascade 

This section describes an algorithm for constructing a cas- 
cade of classifiers which achieves increased detection per- 
formance while radically reducing computation time. The 
key insight is that smaller, and therefore more efficient, 
boosted classifiers can be constructed which reject many of 
the negative sub-windows while detecting almost all posi- 
tive instances (i.e. the threshold of a boosted classifier can 
be adjusted so that the false negative rate is close to zero). 
Simpler classifiers are used to reject the majority of sub- 
windows before more complex classifiers are called upon 
to achieve low false positive rates. 

The overall form of the detection process is that of a de- 
generate decision tree, what we call a "cascade" (see Fig- 
ure 5). A positive result from the first classifier triggers the 
evaluation of a second classifier which has also been ad- 
justed to achieve very high detection rates. A positive result 
from the second classifier triggers a third classifier, and so 
on. A negative outcome at any point leads to the immediate 
rejection of the sub-window. 

Stages in the cascade are constructed by training clas- 
sifiers using AdaBoost and then adjusting the threshold to 
minimize false negatives. Note that the default AdaBoost 
threshold is designed to yield a low error rate on the train- 
ing data. In general a lower threshold yields higher detec- 
tion rates and higher false positive rates. 

For example an excellent first stage classifier can be con- 
structed from a two-feature strong classifier by reducing the 
threshold to minimize false negatives. Measured against a 
validation training set, the threshold can be adjusted to de- 
tect 100% of the faces with a false positive rate of 40%. See 
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Figure 5; Schematic depiction of a the detection cascade. 
A series of classifiers are applied to every sub-window. The 
initial classifier eliminates a large number of negative exam- 
ples with very little processing. Subsequent layers eliminate 
additional negatives but require additional computation. Af- 
ter several stages of processing the number of sub- windows 
have been reduced radically. Further processing can take 
any form such as additional stages of the cascade (as in our 
detection system) or an alternative detection system. 

Figure 4 for a description of the two features used in this 
classifier. 

Computation of the two feature classifier amounts to 
about 60 microprocessor instructions. It seems hard to 
imagine that any simpler filter could achieve higher rejec- 
tion rates. By comparison, scanning a simple image tem- 
plate, or a single layer perceptron, would require at least 20 
times as many operations per sub-window. 

The structure of the cascade reflects the fact that 
within any single image an overwhelming majority of sub- 
windows are negative. As such, the cascade attempts to re- 
ject as many negatives as possible at the earliest stage pos- 
sible. While a positive instance will trigger the evaluation 
of every classifier in the cascade, this is an exceedingly rare 
event. 

Much like a decision tree, subsequent classifiers are 
trained using those examples which pass through all the 
previous stages. As a result, the second classifier faces a 
more difficult task than the first. The examples which make 
it through the first stage are "harder" than typical exam- 
ples. The more difficult examples faced by deeper classi- 
fiers push the entire receiver operating characteristic (ROC) 
curve downward. At a given detection rate, deeper classi- 
fiers have correspondingly higher false positive rates. 

4.1. Training a Cascade of Classifiers 

The cascade training process involves two types of trade- 
offs. In most cases classifiers with more features will 
achieve higher detection rates and lower false positive rates. 
At the same time classifiers with more features require more 
time to compute. In principle one could define an optimiza- 
tion framework in which: i) the number of classifier stages, 
ii) the number of features in each stage, and iii) the thresh- 
old of each stage, are traded off in order to minimize the 
expected number of evaluated features. Unfortunately find- 



ing this optimum is a tremendously difficult problem. 

In practice a very simple framework is used to produce 
an effective classifier which is highly efficient. Each stage 
in the cascade reduces the false positive rate and decreases 
the detection rate. A target is selected for the minimum 
reduction in false positives and the maximum decrease in 
detection. Each stage is trained by adding features until the 
target detection and false positives rates are met ( these rates 
are determined by testing the detector on a validation set). 
Stages are added until the overall target for false positive 
and detection rate is met. 

4.2. Detector Cascade Discussion 

The complete face detection cascade has 38 stages with over 
6000 features. Nevertheless the cascade structure results in 
fast average detection times. On a difficult dataset, con- 
taining 507 faces and 75 million sub-windows, faces are 
detected using an average of 10 feature evaluations per sub- 
window. In comparison, this system is about 15 times faster 
than an implementation of the detection system constructed 
by Rowley etal 3 [11] 

A notion similar to the cascade appears in the face de- 
tection system described by Rowley et al. in which two de- 
tection networks are used [11]. Rowley et al. used a faster 
yet less accurate network to prescreen the image in order to 
find candidate regions for a slower more accurate network. 
Though it is difficult to determine exactly, it appears that 
Rowley et al/s two network face system is the fastest exist- 
ing face detector. 4 

The structure of the cascaded detection process is es- 
sentially that of a degenerate decision tree, and as such is 
related to the work of Amit and Geman [1]. Unlike tech- 
niques which use a fixed detector, Amit and Geman propose 
an alternative point of view where unusual co-occurrences 
of simple image features are used to trigger the evaluation 
of a more complex detection process. In this way the full 
detection process need not be evaluated at many of the po- 
tential image locations and scales. While this basic insight 
is very valuable, in their implementation it is necessary to 
first evaluate some feature detector at every location. These 
features are then grouped to find unusual co-occurrences. In 
practice, since the form of our detector and the features that 
it uses are extremely efficient, the amortized cost of evalu- 
ating our detector at every scale and location is much faster 
than finding and grouping edges throughout the image. 

In recent work Fleuret and Geman have presented a face 
detection technique which relies on a "chain" of tests in or- 
der to signify the presence of a face at a particular scale and 

3 Henry Rowley very graciously supplied us with implementations of 
his detection system for direct comparison. Reported results are against 
his fastest system. It is difficult to determine from the published literature, 
but the Rowley-Baluja-Kanade detector is widely considered the fastest 
detection system and has been heavily tested on real- world problems. 

4 Other published detectors have either neglected to discuss perfor- 
mance in detail, or have never published detection and false positive rates 
on a large and difficult training set. 



-30- 




Figure 6: Example of frontal upright face images used for 
training. 

location [3]. The image properties measured by Fleuret and 
Geman, disjunctions of fine scale edges, are quite different 
than rectangle features which are simple, exist at all scales, 
and are somewhat interpretable. The two approaches also 
differ radically in their learning philosophy. The motivation 
for Fleuret and Geman's learning process is density estima- 
tion and density discrimination, while our detector is purely 
discriminative. Finally the false positive rate of Fleuret and 
Geman's approach appears to be higher than that of previ- 
ous approaches like Rowley et al and this approach. Un- 
fortunately the paper does not report quantitative results of 
this kind. The included example images each have between 
2 and 10 false positives. 

5 Results 

A 38 layer cascaded classifier was trained to detect frontal 
upright faces. To train the detector, a set of face and non- 
face training images were used. The face training set con- 
sisted of 4916 hand labeled faces scaled and aligned to a 
base resolution of 24 by 24 pixels. The faces were ex- 
tracted from images downloaded during a random crawl of 
the world wide web. Some typical face examples are shown 
in Figure 6. The non-face subwindows used to train the 
detector come from 9544 images which were manually in- 
spected and found to not contain any faces. There are about 
350 million subwindows within these non-face images. 

The number of features in the first five layers of the de- 
tector is 1, 10, 25, 25 and 50 features respectively. The 
remaining layers have increasingly more features. The total 
number of features in all layers is 6061. 

Each classifier in the cascade was trained with the 4916 
training faces (plus their vertical mirror images for a total 



of 9832 training faces) and 10,000 non-face sub-windows 
(also of size 24 by 24 pixels) using the Adaboost training 
procedure. For the initial one feature classifier, the non- 
face training examples were collected by selecting random 
sub-windows from a set of 9544 images which did not con- 
tain faces. The non-face examples used to train subsequent 
layers were obtained by scanning the partial cascade across 
the non-face images and collecting false positives. A maxi- 
mum of 10,000 such non-face sub-windows were collected 
for each layer. 

Speed of the Final Detector 

The speed of the cascaded detector is directly related to 
the number of features evaluated per scanned sub-window. 
Evaluated on the MIT+CMU test set [1 1], an average of 10 
features out of a total of 606 1 are evaluated per sub- window. 
This is possible because a large majority of sub-windows 
are rejected by the first or second layer in the cascade. On 
a 700 Mhz Pentium III processor, the face detector can pro- 
cess a 384 by 288 pixel image in about .067 seconds (us- 
ing a starting scale of 1.25 and a step size of 1.5 described 
below). This is roughly 15 times faster than the Rowley- 
Baluja-Kanade detector [1 1] and about 600 times faster than 
the Schneiderman-Kanade detector [13]. 

Image Processing 

All example sub-windows used for training were vari- 
ance normalized to minimize the effect of different light- 
ing conditions. Normalization is therefore necessary during 
detection as well. The variance of an image sub-window 
can be computed quickly using a pair of integral images. 
Recall that a 2 = m 2 - ^ £ x 2 , where a is the standard 
deviation, m is the mean, and x is the pixel value within 
the sub-window. The mean of a sub-window can be com- 
puted using the integral image. The sum of squared pixels 
is computed using an integral image of the image squared 
(i.e. two integral images are used in the scanning process). 
During scanning the effect of image normalization can be 
achieved by post-multiplying the feature values rather than 
pre-multiplying the pixels. 

Scanning the Detector 

The final detector is scanned across the image at multi- 
ple scales and locations. Scaling is achieved by scaling the 
detector itself, rather than scaling the image. This process 
makes sense because the features can be evaluated at any 
scale with the same cost. Good results were obtained using 
a set of scales a factor of 1.25 apart. 

The detector is also scanned across location. Subsequent 
locations are obtained by shifting the window some number 
of pixels A. This shifting process is affected by the scale of 
the detector: if the current scale is s the window is shifted 
by [5 A], where \\ is the rounding operation. 

The choice of A affects both the speed of the detector as 
well as accuracy. The results we present are for A = 1.0. 
We can achieve a significant speedup by setting A = 1.5 
with only a slight decrease in accuracy. 



Table 1: Detection rates for various numbers of false positives on the MIT+CMU test set containing 130 images 



faces. 



^^-^False detections 
Detector 


10 


31 


50 


65 


78 


95 


167 


Viola-Jones 


76.1% 


88.4% 


91.4% 


92.0% 


92.1% 


92.9% 


93.9% 


Viola- Jones (voting) 


81.1% 


89.7% 


92.1% 


93.1% 


93.1% 


93.2% 


93.7% 


Rowley-Baluj a-Kanade 


83.2% 


86.0% 








89.2% 


90.1% 


Schneiderman-Kanade 








94.4% 








Roth-Yang-Ahuja 










(94.8%) 







Integration of Multiple Detections 

Since the final detector is insensitive to small changes in 
translation and scale, multiple detections will usually occur 
around each face in a scanned image. The same is often true 
of some types of false positives. In practice it often makes 
sense to return one final detection per face. Toward this end 
it is useful to postprocess the detected sub-windows in order 
to combine overlapping detections into a single detection. 

In these experiments detections are combined in a very 
simple fashion. The set of detections are first partitioned 
into disjoint subsets. Two detections are in the same subset 
if their bounding regions overlap. Each partition yields a 
single final detection. The corners of the final bounding 
region are the average of the corners of all detections in the 
set. 

Experiments on a Real-World Test Set 

We tested our system on the MIT+CMU frontal face test 
set [11]. This set consists of 130 images with 507 labeled 
frontal faces. A ROC curve showing the performance of our 
detector on this test set is shown in Figure 7. To create the 
ROC curve the threshold of the final layer classifier is ad- 
justed from -co to +oo. Adjusting the threshold to -foo 
will yield a detection rate of 0.0 and a false positive rate 
of 0.0. Adjusting the threshold to -oo, however, increases 
both the detection rate and false positive rate, but only to a 
certain point. Neither rate can be higher than the rate of the 
detection cascade minus the final layer. In effect, a thresh- 
old of -oo is equivalent to removing that layer. Further 
increasing the detection and false positive rates requires de- 
creasing the threshold of the next classifier in the cascade. 
Thus, in order to construct a complete ROC curve, classifier 
layers are removed. We use the number of false positives as 
opposed to the rate of false positives for the x-axis of the 
ROC curve to facilitate comparison with other systems. To 
compute the false positive rate, simply divide by the total 
number of sub-windows scanned. In our experiments, the 
number of sub-windows scanned is 75,081,800. 

Unfortunately, most previous published results on face 
detection have only included a single operating regime (i.e. 
single point on the ROC curve). To make comparison with 
our detector easier we have listed our detection rate for the 
false positive rates reported by the other systems. Table 1 
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Figure 7: ROC curve for our face detector on the 
MIT+CMU test set. The detector was run using a step size 
of 1.0 and starting scale of 1.0 (75,081,800 sub-windows 
scanned). 



lists the detection rate for various numbers of false detec- 
tions for our system as well as other published systems. For 
the Rowley-Baluja-Kanade results [1 1], a number of differ- 
ent versions of their detector were tested yielding a number 
of different results they are all listed in under the same head- 
ing. For the Roth-Yang-Ahuja detector [10], they reported 
their result on the MIT+CMU test set minus 5 images con- 
taining line drawn faces removed. 

Figure 8 shows the output of our face detector on some 
test images from the MIT+CMU test set. 

A simple voting scheme to further improve results 

In table 1 we also show results from running three de- 
tectors (the 38 layer one described above plus two similarly 
trained detectors) and outputting the majority vote of the 
three detectors. This improves the detection rate as well as 
eliminating more false positives. The improvement would 
be greater if the detectors were more independent. The cor- 
relation of their errors results in a modest improvement over 
the best single detector. 



Figure 8: Output of our face detector on a number of test images from the MIT+CMU test set. 



6 Conclusions 

We have presented an approach for object detection which 
minimizes computation time while achieving high detection 
accuracy. The approach was used to construct a face detec- 
tion system which is approximately 1 5 times faster than any 
previous approach. 

This paper brings together new algorithms, representa- 
tions, and insights which are quite generic and may well 
have broader application in computer vision and image pro- 
cessing. 

Finally this paper presents a set of detailed experiments 
on a difficult face detection dataset which has been widely 
studied. This dataset includes faces under a very wide range 
of conditions including: illumination, scale, pose, and cam- 
era variation. Experiments on such a large and complex 
dataset are difficult and time consuming. Nevertheless sys- 
tems which work under these conditions are unlikely to be 
brittle or limited to a single set of conditions. More impor- 
tantly conclusions drawn from this dataset are unlikely to 
be experimental artifacts. 
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